HLT & NLP within the Arabic world: Arabic Language and local languages processing Status Updates and Prospects WORKSHOP DATE: SATURDAY 31 May 2008 Workshop chair
نویسندگان
چکیده
Automatic versus interactive analysis for the massive vowelization, tagging and lemmatization of Arabic Fathi Debili, Zied Ben Tahar, LLACAN, INALCO, CNRS, France and Emna Souissi, ESSTT, Tunisia How could we produce annotated texts massively with optimal efficiency, reproducibility and cost? Instead of correcting the output of the automatic analysis with dedicated tools, as suggested currently, we found it more advisable to use interactive tools for analysis, where manual editing is fed in real time into automatic analysis. We address the issue of evaluating these tools, along with their performance in terms of linguistic ergonomy, and propose a metric for calculating the cost of editing as a number of keystrokes and mouse clicks. By way of a simple protocol addressing Arabic vowelization, tagging and lemmatization, we discover that, surprisingly, the best interactive performance of a system is not always correlated to its best automatic performance. In other words, the most performing automatic linguistic behavior of a system does not always yield the best interactive behavior, when manual editing is involved. Prague Arabic Dependency Treebank: A Word on the Million Words Otakar Smrz, Viktor Bielicky, Iveta Kourilova, Jakub Kracmar, Jan Hajic, Petr Zemanek Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic Prague Arabic Dependency Treebank (PADT) consists of refined multi-level linguistic annotations over the language of Modern Written Arabic. The kind of morphological and syntactic information comprised in PADT differs considerably from that of the Penn Arabic Treebank (PATB). This paper explores the possibility to merge both of these treebanks into a uniform resource that would exceed the existing ones in the level of linguistic detail, accuracy, and quantity. It overviews the character of PADT and its motivations, and reports on converting and enhancing the PATB data. The merged, rule-checked and revised annotations, which amount to over one million words, as well as the open-source computational tools developed in the project are considered for publication this year. Arabic Named Entity Recognition using Conditional Random Fields Yassine Benajiba and Paolo Rosso, Natural Language Engineering Lab. Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Spain. The Named Entity Recognition (NER) task consists in determining and classifying proper names within an open-domain text. This Natural Language Processing task proved to be harder for languages with a complex morphology such as the Arabic language. NER was also proved to help Natural Language Processing tasks such as Machine Translation, Information Retrieval and Question Answering to obtain a higher performance. In our previous works we have presented the first and the second version of ANERsys: an Arabic Named Entity Recognition system, whose performance we have succeeded to improve by more than 10 points, from the first to the second version, by adopting a different architecture and using additional information such as Part-Of-Speech tags and Base Phrase Chunks. In this paper, we present a further attempt to enhance the accuracy of ANERsys by changing the probabilistic model from Maximum Entropy to Conditional Random Fields which helped to improve the results significantly. Can the building of corpus-based Arabic concordances with AraConc and DIINAR.1 tackle the issue of Arabic polyglossia? Joseph Dichy and Ramzi Abbès, Université Lumière-Lyon 2 and ICAR (CNRS-Lyon 2) Research and development in Arabic NLP has nothing to gain from an oversimplified view of the Arabic language, considering the high level of linguistic variation commonly observed in the Arabic language as a whole, including Arabic vernaculars (or ‘Dialects’). The real challenge is both to keep hold of the current sate of knowledge in the description of linguistic variation – complex as it may seem, it is in itself a simplified representation when compared to real-world language use –, and at the same time, to seek efficient formalized approaches that allow software applications to operate at a sufficient level of granularity. In this paper, we will try and show how a reasonable level of granularity and of subsequent feasibility can be reached through a balance between the complexity of Arabic corpus data and present-day tools that have been originally devised for Modern Standard Arabic (MSA). Section 1 introduces the complex system of the competence of communication in Arabic (Arabic polyglossia). Section 2 suggests a mapping of what could become, in the future, a real-world hyper-base of corpora in the Arabic language. Section 3 is an endeavour at contributing to the above challenge, through the use of tools initially devised for MSA. These are the DIINAR.1 knowledge database (DIctionnaire INformatisé de l’ARabe, version 1) and the last born of the analyzers based on it, the AraConc concordance software, which has been devised to extract concordances and frequency lists in Modern Standard Arabic. Two types of actual results are made available: statistical results and concordances. Examples of how these tools can meet linguistic variation in Arabic, and begin to evolve towards a new generation of tools are given.
منابع مشابه
Arabic Dialect Processing Tutorial
The existence of dialects for any language constitutes a challenge for NLP in general since it adds another set of variation dimensions from a known standard. The problem is particularly interesting and challenging in Arabic and its different dialects, where the diversion from the standard could, in some linguistic views, warrant a classification as different languages. This problem would not b...
متن کاملNatural Language Processing for Less Privileged Languages: Where do we come from? Where are we going?
In the context of the IJCNLP workshop on Natural Language Processing (NLP) for Less Privileged Languages, we discuss the obstacles to research on such languages. We also briefly discuss the ways to make progress in removing these obstacles. We mention some previous work and comment on the papers selected for the workshop.
متن کاملروشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملACL - 08 : HLT Software Engineering , Testing , and Quality Assurance for Natural Language Processing
Software engineering in general is a first-class research object in computer science, but generally has not been treated as such within the natural language processing community. This is despite the fact that natural language as an input type has unique characteristics that present special problems for software testing, quality assurance, and even requirements specification. The goals of this w...
متن کامل